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Empirical  Research  on  Software  Mairitenance:  1981-1990 


Abstract 


Despite  its  economic  importance,  the  activity  of  software  maintenance  is  relatively 
under-studied  by  researchers.  This  comprehensive  survey  documents  that  only  two 
percent  of  all  articles  appearing  in  three  leading  journals  and  two  refereed  conferences 
over  the  past  decade  were  devoted  to  empirical  studies  of  software  maintenance.  The 
primary  purpose  of  this  paper  is  to  document  "what  is  known"  from  this  research,  and  to 
suggest  future  avenues  of  research.  The  sixty-one  articles  surveyed  are  conveniently 
summarized  as  to  major  differences  and  similarities  in  a  set  of  detailed  tables.  The  text  is 
used  to  highlight  major  findings  and  differences.  Although  the  emphasis  of  the  paper  is 
on  the  subject  matter,  a  section  discussing  appropriate  research  methodologies  is  included 
as  a  guide  to  researchers  new  to  this  area. 
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1.  INTRODUCTION 

1.1  Why  Empirical  Studies  of  Software  Maintenance? 

While  much  is  written  about  new  tools  and  methods  for  developing  new  software,  a 
significant  percentage^of  professional  software  engineers'  time  is  spent  maintaining 
existing  software.   Software  maintenance  represents  a  large  and  growing  expense  for 
organizations^.   In  addition,  due  to  the  shortage  of  experienced  software  engineers,  the 
preponderance  of  maintenance  work  represents  an  opportunity  cost  of  those  resources  that 
would  otherwise  be  devoted  towards  developing  new  systems.   Therefore,  software 
maintenance  represents  an  activity  of  considerable  economic  importance  and  is  a 
candidate  for  academic  research. 

As  an  aid  to  researchers  interested  in  maintenance  or  maintenance-related  issues,  this 
paper  surveys  the  past  decade's  empirical  studies  of  software  maintenance.   The  focus  on 
empirical  studies  was  deliberately  chosen  due  to  the  relative  newness  of  the  field.   Unlike 
more  mature  disciplines,  this  area  does  not  yet  have  a  large  body  of  well-accepted  theory 
upon  which  to  build.   Therefore,  the  primary  early  gains  have  been  made  in  careful 
observation  of  maintenance  activities  through  empirical  studies.    The  intent  of  the  survey 
is  to  collect,  classify,  and  analyze  the  existing  body  of  work,  with  special  attention  paid  to 
identifying  those  issues  where  further  research  would  appear  to  be  most  beneficial. 

1.2  Scope  of  the  Review 

Schneidewind,  in  his  guest  editor's  introduction  to  a  special  issue  on  software 
maintenance  in  the  March  1987  issue  of  IEEE  Transactions  on  Software  Engineering,  {lEEE- 
TSE)  noted  that  there  was  not  a  single  article  on  maintenance  in  lEEE-TSE  over  a  past 
period  of  a  little  more  than  a  year.  And,  a  preliminary  exploration  of  two  years  of  archival 
journals  revealed  a  general  dearth  of  empirical  work  in  software  maintenance.  Therefore, 
the  scope  of  this  review  was  set  to  comprehensively  examine  the  leading  archival  journals 
and  refereed  conference  proceedings  over  the  past  decade.   The  choice  of  publication 


^Various  studies  have  noted  that  maintenance  is  estimated  to  comprise  from  50-80%  of  software  development 
activities.  Some  of  these  are  summarized  in  [Kemerer,  1987]. 

^For  example,  it  has  b»een  estimated  that  60  percent  of  all  business  expenditures  on  computing  are  for 
maintenance  of  software  written  in  COBOL  alone  [Freedman,  1986]. 


outlets  included  three  journals  and  two  conference  proceedings.    These  five  outlets 
published  3,018  papers  in  the  tin^e  span  of  the  survey.   The  three  journals  were: 

•  Communications  of  the  ACM  (CACM) 

•  IEEE  Transactions  on  Software  Engineering  (lEEE-TSE) 

•  Journal  of  Systems  and  Software  iJSS) 

Communications  of  the  ACM  is  the  journal  with  the  largest  circulation  of  the 
Association  of  Computing  Machinery,  a  leading  professional  society.    Due  to  its  wide 
circulation  and  monthly  format,  it  provides  a  large  number  of  highly  visible  pages  within 
which  to  publish  refereed  articles.   It  has  also  scored  highly  on  subjective  rankings  of 
"journal  quality",  which  contributes  to  its  attractiveness  as  a  publication  outlet  for 
scholars^.  IEEE  Transactions  on  Software  Engineering  is  also  a  well-regarded  monthly 
journal  which  is  focused  on  software  engineering  topics.    The  Journal  of  Systems  and 
Software  is  another  frequent  source  of  software  engineering-related  articles.    It  currently 
has  nine  issues  per  year,  although  there  are  plans  to  expand  to  a  monthly  format. 

The  refereed  conferences  chosen  were: 

•  IEEE  International  Conference  on  Software  Engineering 

•  IEEE  Conference  on  Software  Maintenance 

The  IEEE  International  Conference  on  Software  Engineering  is  a  well-regarded  refereed 
conference  proceedings,  and  is  focused  on  software  engineering  topics.   The  IEEE 
Conference  on  Software  Maintenance  was  an  obvious  choice  given  the  topic. 

There  are  arguably  other  journals  that  could  be  included  on  such  a  list.^    However, 
given  that  this  set  alone  generated  over  3,000  possible  articles  to  review  implied  that  this 
sample  would  be  likely  to  result  in  finding  most  of  the  important  papers  in  this  area.   In 
addition,  while  the  statistics  and  tables  included  below  are  limited  to  those  papers  that 
appeared  in  one  of  those  five  sources,  a  few  widely-cited  papers  that  have  appeared 
elsewhere  that  are  relevant  to  this  review  have  also  been  included  in  the  discussion. 


^For  example,  in  an  unpublished  survey  by  two  computer  information  systems  researchers  at  New  York 
University  of  the  top  journals  ranked  by  computer  information  systems  faculty,  CACM  ranked  third,  and  /£££- 
T5E  ranked  fifth.  ]SS  was  not  included  in  the  study  and  therefore  was  unranked  (Ramesh  and  Stohr,  1989]. 
'^There  is  a  relatively  new  journal  from  Wiley  called  the  ]oumal  of  Software  Maintenance.   It  was  not  included 
in  this  review  due  its  relative  absence  during  the  period  surveyed,  but  would  be  a  logical  choice  for  a  review 
spanning  the  next  decade. 
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The  criteria  for  inclusion  in  this  set  were  that  the  paper  had  to  present  and  analyze 
empirical  data  relating  to  software  maintenance.   This  research  adopts  the  ANSI/IEEE 
standard  729  definition  of  maintenance:  "Modification  of  a  software  product  after  delivery 
to  correct  faults,  to  improve  performance  or  other  attributes,  or  to  adapt  the  product  to  a 
changed  environment"  [Schneidewind  1987].    Empirical  research  on  software  maintenance 
has  much  in  common  with  research  on  new  software  development,  since  both  involve 
the  creation  of  working  code  through  the  efforts  of  human  developers  equipped  with 
appropriate  experience,  tools,  and  techniques.    However,  software  maintenance  involves  a 
fundamental  difference  from  development  of  new  systems  in  that  the  software  maintainer 
must  interact  with  an  existing  system  [Swanson  and  Beath  1989b]. 

Some  of  the  research  included  herein  overlaps  the  areas  of  both  maintenance  and 
development.   One  example  is  that  there  is  evidence  to  suggest  that  development 
decisions,  such  as  the  use  of  structured  programming  techniques,  are  expected  to  have  a 
noticeable  effect  on  later  maintenance  efforts.    Another  example  is  that  it  has  been  noted 
that  the  cost  of  correcting  program  errors  typically  increases  significantly  the  later  they  are 
discovered,  suggesting  that  extra  effort  in  the  development  phase  will  reduce  maintenance 
costs  [Shen  et  al.  1985].   Complexity  metrics  are  another  area  of  study  that  applies  to  both 
development  and  maintenance.   To  account  for  these  sorts  of  overlaps,  a  study  or 
experiment  did  not  have  to  specifically  address  maintenance  issues  in  order  to  qualify  for 
inclusion,  but  was  required  to  provide  insight  that  could  be  readily  applied  to 
maintenance.   It  is  suggested  that  this  review  may  therefore  be  broadly  useful  to 
researchers  in  new  software  development  who  may  also  benefit  from  familiarity  with  this 
work. 

The  identification  of  articles  suitable  for  inclusion  was  done  manually  through  the 
review  of  titles  and  abstracts  of  individual  articles  in  each  publication  and  then  a  reading 
of  the  full  article  for  those  which  initially  appeared  to  be  appropriate.   Eighty-three  articles 
were  originally  identified  as  candidates,  and  of  these,  sixty-one  were  ultimately  found  to 
meet  the  criteria  [Ream  1991].  This  approach  to  selecting  articles,  of  course,  leaves  open  the 
possibility  that  some  may  have  been  inadvertently  omitted.^    To  reduce  the  probability  of 
this  type  of  error,  a  check  of  the  selected  titles  was  made  against  an  existing  bibliography  of 
empirical  software  maintenance  research  that  was  published  in  the  1988  IEEE  Conference 
on  Software  Maintenance  [Hale  and  Haworth  1988].   All  of  the  articles  cited  there  are 


^Some  difficult  conscious  omissions  were  made  as  well.  For  example  a  1982  CACM  article  by  Elshoff  and 
Marcotty  addresses  many  items  of  interest  to  maintenance.  However,  it  does  not  present  and  analyze  a  new  set 
of  empirical  data,  but  rather  relies  on  a  set  of  constructed  examples. 

3 


included  in  the  this  survey,  as  well  as  approximately  forty  additional  articles  that  were  not 
included  on  that  list.    Thus,  although  inadvertent  omissions  may  remain,  this 
compilation  is  believed  to  be  representative  of  empirical  software  maintenance  over  the 
last  decade. 

One  of  the  first  findings  from  this  review  is  the  relative  scarcity  of  empirical  work  in 
software  maintenance.   A  total  of  sixty-one  articles  out  of  3,018  were  found  to  meet  the 
criteria  set  out,  approximately  2%  of  the  total  (See  Table  1).  This  scarcity  of  research 
confirms  the  earlier  but  less  systematic  observation  of  Schneidewind.    Even  allowing  for 
inadvertent  omissions,  the  percentage  of  effort  devoted  to  this  type  of  work  in  software 
maintenance,  as  reflected  by  its  publication  in  scholarly  outlets,  seems  far  below  what  its 
practical  importance  would  seem  to  warrant.   This  neglect  of  software  maintenance  as  a 
research  area  should  concern  practitioners,  since  little  effort  is  being  devoted  to 
discovering  new  knowledge  about  an  activity  of  considerable  economic  importance. 

A  related  concern  may  be  that  there  is  no  clear  trend  for  more  work  in  this  area. 
Figure  1  shows  both  the  raw  frequency  counts  by  year  plus  a  cumulative  average.  The  raw 
counts  may  be  somewhat  misleading,  given  the  irregular  publication  cycles,  particularly  of 
the  IEEE  Conference  on  Software  Maintenance.    However,  there  is  no  strong  trend  in  the 
average,  which  suggests  that  Schneidewind's  call  for  more  work  in  this  area  has  not  been 
acted  upon. 


Empirical  Studies  of  Software  Maintenance: 
Frequency  by  Year 


1  2    T 


10-- 


D  Number 

—  Moving  Average 


Figure  1:  Frequency  of  articles  by  year 
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1.3  Organization  of  the  paper 

Despite  the  small  percentage  of  articles  discovered,  61  articles  form  an  amount  of 
material  sufficiently  large  that  some  structure  needs  to  be  imposed  in  order  to  properly 
convey  the  contributions  made  by  each  study.   The  approach  adopted  here  was  to  briefly 
summarize  the  contributions  of  every  article  in  table  form,  and  then  expand  on  these 
comments  in  the  text  for  a  subset  of  those  articles  that  merited  additional  discussion.   The 
papers  are  organized  under  three  broad  areas,  with  one  area  subdivided  into  two  more 
focused  parts.   These  areas,  with  the  number  of  articles  in  parentheses,  are: 

•  Software  Complexity  Measurement 

••        Modularity  and  Structure  Metrics  (15) 
••        Other  Complexity  metrics  (16) 

•  Comprehension  (15) 

•  General  Maintenance  Management  (15) 

The  format  of  the  tables  includes  the  following  data: 

•  Author,  year 

•  Publication  in  which  the  article  appeared 

•  Methodology  (Field  studies,  experiments,  and  surveys.  Lab  studies  and  experiments) 

•  Data  source 

•  Dependent  variable 

•  Statistical  test(s)  employed,  if  any 

•  Brief  summary  of  key  results 

The  tables  are  additionally  designed  to  assist  readers  interested  in  narrower  topics,  e.g., 
"COBOL  programming"  or  "laboratory  experiments  involving  students". 

The  remainder  of  this  paper  is  organized  as  follows.   The  next  section,  "Software 
Complexity  Measurement"  presents  work  whose  primary  contribution  lies  in  the 
relationship  between  complexity  measurement  and  software  maintenance  results.    Section 
3,  "Comprehension"  focuses  on  research  whose  primary  interest  is  in  how  maintainers' 
comprehension  of  existing  software  can  be  improved.    All  other  topics  are  summarized  in 
section  4,  "General  Maintenance  Management"  and  section  5  provides  a  summary  and 


discussion  of  some  meta-issues  highlighted  by  this  review  of  the  previous  decade's  worth 
of  software  maintenance  research.    A  final  section  provides  some  concluding  remarks. 

2.  SOFTWARE  COMPLEXITY  MEASUREMENT  RESEARCH 
2.1  Introduction 

Research  in  this  area  is  generally  focused  on  the  relationship  between  a  complexity 
measure  and  maintenance  effort,  or  among  complexity  measures.    Basili  defines  software 
complexity  as  "...a  measure  of  the  resources  expended  by  another  system  while  interacting 
with  a  piece  of  software.  If  the  interacting  system  is  people,  the  measures  are  concerned 
with  human  efforts  to  comprehend,  to  maintain,  to  change,  to  test,  etc.,  that  software." 
(1980,  p.  232).  Curtis  et  al.  similarly  define  the  same  concept  (which  they  refer  to  as 
psychological  complexity)  as:  "Psychological  complexity  refers  to  characteristics  of  software 
which  make  it  difficult  to  understand  and  work  with"  (1979,  p.  96).    Both  of  these  authors 
note  that  the  cognitive  load  on  a  software  maintainer  is  believed  to  be  higher  when 
structured  programming  techniques  are  not  used. 

Schneidewind  estimates  that  75-80  percent  of  existing  software  was  produced  prior  to 
significant  use  of  structured  programming  (1987).    A  key  component  of  structured 
programming  approaches  is  modularity,  defined  by  Conte  et  al.  (1986,  p.  197)  as  "the 
programming  technique  of  constructing  software  as  several  discrete  parts."    Structured 
programming  proponents  argue  that  modularization  is  an  improved  programming  style, 
and  therefore,  the  absence  of  modularity  is  likely  to  be  a  significant  practical  problem.   A 
number  of  researchers  have  attempted  to  empirically  validate  the  impact  of  modularity  on 
either  software  quality  or  cost  with  data  from  actual  systems,  achieving  somewhat  mixed 
results.   (See  Table  2.) 

There  is  a  significant  amount  of  other  work  in  software  complexity  metrics  area,  for 
example,  volume  measures  such  as  those  of  Halstead's  Software  Science.    (See  Table  3.) 
Work  in  this  area  often  overlaps  the  work  in  modularity  and  structure,  with  many  articles 
reporting  results  for  both.    Given  the  large  amount  of  work  in  measurement,  an  attempt 
has  been  made  to  place  an  individual  article  into  either  Table  2  or  Table  3,  but  not  both. 
Researchers  who  are  broadly  interested  in  the  issue  of  software  complexity  measurement 
and  its  relation  to  productivity  should  carefully  examine  both  tables. 

Dependent  variables  in  this  research  are  generally  either  quality  related  --  number  of 
errors  or  defects  found  (sometimes  number  of  changes  is  used  as  a  surrogate),  or 
productivity  related  -  effort  required  to  make  a  change,  time  required  to  debug,  et  cetera. 
This  emphasis  on  performance  evaluation  is  a  pervasive  theme  in  this  literature. 
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A  combination  of  metrics  from  Uiilcrent 
classes  (code,  structure,  hybrid)  was  found  to 
be  a  far  more  effective  predictor  of 
maintenance  than  any  single  metric. 

The  conceptually  simple  measures  of 
complexity  such  as  the  number  of  lines  of 
code  correlated  with  the  development  effort 
just  as  well  or  belter  than  the  more 
sophisticated  standard  complexity  niclrics. 

63%  of  the  variability  in  the  nieasurenicnt  of 
classical  metrics  is  represented  by  only  one 
factor  in  the  projects  analyzed    volume. 

Certain  COBOL  style  metrics 
(characteristics)  have  significant 
correlations  willi  debug  times,  to  soiiio 
degree  independent  of  program  coiiqilcxily 
and  size. 

The  relative  complexity  metric  is  a 
"reasonable"  statistical  tool  for  identifying 
programs  that  will  require  more  effoil  to 
debug. 

Detailed  PDLs  had  a  high  degree  of 
correlation  with  corresponding  source  code. 
PDL  metrics  follow  many  of  tlie  same 
statistical  patterns  exhibited  in  code  metrics. 

In  Ihe  environment  used  in  the  testing, 
classification  trees  had  an  average  accuracy 
of  79.3%  for  faull-prone  and  effort  prone 
components. 

Programs  containing  GOFOs  were  foiinil  nioie 
likely  to  have  errors,  look  longer  to  tlebug, 
and  had  worse  structure  than  GO  rO  less 
jirograms 
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2.2  Modularity  and  Structure 

2.2.1  Module  Size 

An  important  widely  disseminated  early  piece  of  research  on  the  impact  of  modularity 
and  structure  was  by  Vessey  and  Weber  (1983).   They  studied  repair  maintenance  in 
Australian  and  US  data  processing  organizations  and  used  subjective  assessments  of  the 
degree  of  modularity  in  a  large  number  of  COBOL  systems.   In  one  data  set  they  found  that 
code  with  greater  modularity  (on  average,  more,  smaller  modules)  was  associated  with 
fewer  repairs;  in  the  other  data  set  no  effect  was  found.   These  equivocal  results  were 
unexpected  by  the  authors,  and  in  their  conclusion  they  note  "Our  results  stand  as  a 
challenge  to  some  conventional  wisdom  and   the  proponents  of  structured  programming 
(who  include  us).     We  readily  acknowledge  that  our  research  is  exploratory  and  there  are 
problems  with  the  statistical  model.     Nevertheless,  the  results  are  anomalous."   (1983, 
p.l34). 

A  number  of  researchers  took  up  this  challenge.   Since  Vessey  and  Weber  focused  on 
repair  maintenance,  many  follow-on  studies  have  examined  the  "number  of  errors"  as 
their  dependent  variable.    Basili  and  Perricone  (1984)  and  Shen,  et  al.  (1985)  in  separate 
studies  found  that  larger  modules  tended  to  have  significantly  fewer  errors.    Similarly, 
Compton  and  Withrow,  in  a  recent  examination  of  263  Ada  packages,  found  that  smaller 
packages  had  a  disproportionately  high  share  of  the  errors.   A  study  by  An  et  al.  (1987) 
analyzed  change  data  from  two  releases  of  UNIX.   They  found  that  the  average  size  of 
unchanged  modules  (417  lines  of  C)  was  larger  than  that  of  changed  modules  (279  lines  of 
C)  Unfortunately,  they  did  not  provide  any  analysis  to  determine  if  this  difference  was 
statistically  significant. 

However,  other  studies  that  have  appeared  elsewhere  have  suggested  that  some  degree 
of  modularity  is  necessary.    Korson  and  Vaishnavi  (1986)  conducted  four  experiments 
comparing  the  time  required  to  modify  two  alternative  versions  of  a  piece  of  software,  one 
modular  and  one  monolithic.    In  three  of  the  four  cases  the  modular  version  was 
significantly  easier  to  modify.   Therefore,  a  newer,  alternative  hypothesis  is  that  modules 
that  are  either  too  large  (undermodularization)  or  too  small  (overmodularization)  are 
unlikely  to  be  optimal.   For  example,  Conte  et  al.  (1986,  p.  109)  note  that:  "The  degree  of 
modularization  affects  the  quality  of  a  design.     Overmodularization  is  as  undesirable  as 
undermodularization."    It  is  a  common  general  belief  that  large  modules  are  more  difficult 
to  understand  and  modify  than  small  ones,  and  maintenance  costs  will  be  expected  to 
increase  with  average  module  size.   If  the  modules  are  too  large  they  are  unlikely  to  be 
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devoted  to  single  purpose.   However,  research  has  clearly  shown  that  a  system  can  be 
composed  of  too  many  small  modules.    If  the  modules  are  too  small,  then  much  of  the 
complexity  will  reside  in  the  interfaces  between  modules  and  therefore  they  will  again  be 
difficult  to  comprehend.   Interfaces  are  relevant  because  they  have  been  shown  to  be 
among  the  most  problematical  components  of  programs  [Basili  and  Perricone  1984]. 
Therefore,  complexity  could  decrease  as  module  size  increases.   Some  recent  work  has 
suggested  that  a  U-shaped  function  is  likely,  with  an  optimal  module  size  that  lies  between 
the  extremes  noted  by  earlier  research  [Banker  et  al.  1992]. 

2.2.2  Coupling 

Another  important  issue  within  this  set  of  literature  is  the  effect  of  module  coupling 
on  performance.  A  1981  study  by  Troy  and  Zweben  explored  a  number  of  hypotheses 
dealing  with  structured  programming  concepts,  including  the  notion  of  coupling.    Some  of 
the  intuition  behind  structured  programming  is  that  minimally  related  tasks  should  be 
kept  independent  by  locating  their  functions  in  separate  modules.   Independence  of 
modules  is  maximized  to  the  degree  that  coupling  among  modules  is  minimized  [Lohse 
and  Zweben  1984].  Of  all  the  hypotheses  tested  by  Troy  and  Zweben,  they  found  the 
strongest  support  for  the  notion  that  the  number  of  source  code  modifications  (a  surrogate 
for  errors)  was  positively  correlated  with  a  high  degree  of  coupling,  i.e.,  highly  cohesive 
but  loosely  coupled  modules  were  less  likely  to  require  modification. 

Continuing  in  this  stream  of  research  Selby  and  Basili  studied  a  large  production 
system  for  which  actual  error  data  were  available  (1988).   They  used  as  their  independent 
variable  the  ratio  of  coupling  to  cohesion,  (cohesion  defined  intuitively  as  the  amount  of 
interaction  among  elements  within  a  module),  where  a  low  value  of  such  a  ratio  was 
believed  to  reflect  good  structured  programming  practice.  They  found  strong  support  for 
the  notion  that  high  values  of  their  ratio  were  associated  with  higher  error  rates  and 
higher  efforts  to  correct  errors. 

Lohse  and  Zweben  note  that  there  are  multiple  dimensions  to  improving  module 
coupling,  including  the  size  and  type  of  the  information  passed  to  the  module.   They 
performed  a  lab  experiment  using  student  programmers  to  determine  whether  passing 
information  using  either  global  variables  or  parameter  lists  had  an  effect  on  the  time 
required  to  modify  a  program.   They  note  that  the  literature  offers  corxflicting  advice  on 
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this  question°  and  therefore  it  was  a  topic  meriting  experimental  study.    Unfortunately, 
their  experiment  yielded  no  conclusive  results.   A  later  study  by  Yau  and  Chang,  however, 
found  that  use  of  global  variables  was  correlated  with  more  errors  and  changes  [Yau  and 
Chang  1988]. 

In  general,  not  enough  is  known  about  the  proper  ways  to  minimize  coupling.  This  is 
clearly  a  topic  that  merits  further  research,  particularly  in  newer  implementations,  such  as 
object-oriented  environments,  where  the  equivalent  of  coupling  needs  to  be  considered  in 
the  design  of  objects,  methods  and  classes. 

2.3  Complexity  Metrics 

Within  the  empirical  research  on  software  maintenance  surveyed,  the  largest  part  of 
that  was  devoted  to  software  metrics,  particularly  those  relating  to  aspects  of  software 
complexity  as  defined  above.    With  only  a  few  exceptions,  the  emphasis  in  this  review  is 
on  those  studies  of  metrics  that  examined  the  relationship  between  the  metrics  and 
maintenance-related  dependent  variables,  such  as  error  rates,  time  to  locate  and  correct 
defects  (debugging),  and  number  of  subsequent  changes. 

2.3.1    Relationships  among  Metrics  and  Maintenance 

Sunohara  et  al.  simultaneously  collected  data  on  several  of  the  main  complexity 
metrics,  including  McCabe's  V(G)  and  Halstead's  E,  as  well  as  source  lines  of  code  (SLOC)^ 
for  a  medium-sized  Fortran  system  and  calculated  the  inter-metric  correlations  [McCabe 
1976]  [Halstead  1977]  [Sunohara  et  al.  1981].   For  example,  they  found  a  Pearson  correlation 
coefficient  value  for  the  pairwise  correlation  of  non-comment  SLOC  and  Halstead's  E  of 
.812  (p<.001).   The  implication  of  these  strong  correlations  among  these  metrics,  is  that  a 
metric  such  as  SLOC  may  be  preferable,  since  it  provides  similar  information  but  with 
greater  ease  of  collection  and  of  managerial  interpretation.   Similar  results  were  obtained 
by  Gremillion,  who  collected  multiple  metrics  for  346  PL/1  programs  [Gremillion  1984]. 
Interestingly,  his  correlation  between  SLOC  and  E  was  .82  (p<.001),  nearly  identical  to  the 
Sunohara  et  al.  study.   Gremillion's  main  finding  was  that  the  number  of  program  defects 
was  significantly  related  to  the  complexity  metrics,  and  in  particular  that  the  best  single 
predictor  metric  was  SLOC.   Essentially  the  same  results  were  found  by  Lind  and  Vairavan 


^Structured  design  argues  that  use  of  global  variables  will  result  in  higher  coupling,  while  complexity  metrics 
such  as  Halstead's  E  would  indicate  less  coupling  stemming  from  use  of  global  variables  (Lohse  and  Zweben 
1984,  p.  3031. 

^These  are  referred  to  as  "steps"  in  their  paper,  as  this  is  the  standard  nomenclature  in  Japan.  (See,  for 
example,  [Cusumano  and  Kemerer  1990].) 
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in  a  study  of  a  number  of  releases  of  a  large  medical  imaging  system  [Lind  and  Vairavan 
1989].   They  found  a  high  correlation  between  the  more  complex  metrics  and  SLOC,  and 
found  that  SLOC  was  the  best  single  predictor  of  number  of  "system  performance  reports" 
and  development  effort.   Clearly,  at  least  one  aspect  of  complexity  is  represented  by  the 
simple  size  metric  SLOC. 

There  are  two  main  conclusions  that  can  be  drawn  from  this  set  of  research.  The  first 
is  that  complexity  metrics  can  be  useful  predictors  of  the  maintenance  behavior  of  systems, 
and  that  greater  use  of  measurement  in  systems  development,  testing,  and  maintenance  is 
recommended.  The  second  conclusion  is  that  a  number  of  the  more  complex  metrics  may 
be  essentially  measuring  the  size  of  the  program  or  other  component  under  investigation, 
and  therefore  may  provide  little  additional  information.  This  may  obviate  their  use  if  it  is 
believed  to  be  difficult  to  collect  or  implement  use  of  these  metrics  within  an  organization. 

2.3.2   Dimensions  of  Software  Complexity 

Stemming  in  part  from  the  results  summarized  above,  some  research  has  focused  on 
attempting  to  identify  unique  dimensions  of  software  complexity,  i.e.,  which  metrics  can 
be  seen  as  relatively  independent  and  thus  may  represent  different  dimensions.    Li  and 
Cheung,  in  a  study  of  255  student  FORTRAN  programs,  collected  data  on  31  separate 
metrics  [Li  and  Cheung  1987].   They  found  that  the  metrics  could  be  roughly  divided  into 
two  groups,  "volume  metrics"  (i.e.,  size)  and  "control  metrics".    Their  recommendation 
was  to  use  a  metric  from  both  groups,  or  to  use  a  hybrid  metric  that  could  capture  elements 
of  both.   A  similar  conclusion  was  reached  by  Wake  and  Henry,  who  investigated  the 
relationship  between  software  metrics  and  the  number  of  LOC  changed  in  a  set  of  193 
modules  of  C  code  [Wake  and  Henry  1988].  They  suggest  that  a  model  with  a  combination 
of  metric  types  predicts  better  than  any  single  metric.   Most  recently  Munson  and 
Khoshgoftaar  used  factor  analysis  to  isolate  two  dimensions  of  complexity  which  they  label 
"volume"  and  "modularit/'  [Munson  and  Khoshgoftaar  1990].    They  found  their 
generated  metric  to  be  good  predictor  of  debugging  time  for  a  set  of  27  FORTRAN 
programs. 

This  research  provides  additional  support  to  the  notion  of  using  software  complexity 
metrics  to  predict  maintenance  activity.    It  further  refines  earlier  metric  work  in  noting 
that  a  small  number  of  underlying  dimensions  of  complexity  are  represented  in  the 
literature  by  a  relatively  large  number  of  proposed  metrics.   For  practitioners  the  result  is 
that  they  should  consider  adopting  a  small  set  of  metrics  to  aid  their  management  of  the 
maintenance  process.   For  researchers  the  conclusion  is  that  proposals  for  new  metrics 
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must  demonstrate  both  orthogonality  to  existing  metrics  and  superior  performance  in 
terms  of  predicting  dependent  variables  of  interest. 

3.  COMPREHENSION  RESEARCH 

The  single  critical  factor  that  differentiates  software  maintenance  from  new  software 
development  is  the  software  engineer's  need  to  interact  with  existing  software  and 
documentation.    Therefore  it  is  not  surprising  that  a  significant  amount  of  software 
maintenance  research  has  focused  on  the  issue  of  comprehension.    The  research  described 
in  the  previous  section  on  complexity  metrics  may  also  be  seen  as  applying  to 
comprehension.   This  is  because  a  program  that  is  considered  to  be  more  error-prone 
because  it,  say,  contains  more  complex  logic  paths,  must  be  founded  on  the  notion  that 
such  a  program  is  harder  for  the  maintainer  to  comprehend  and  therefore  harder  to 
correctly  maintain. 

However,  such  arguments  about  the  impact  of  complexity  on  comprehension  are  only 
indirect  in  that,  even  when  increased  complexity  is  shown  to  be  correlated  to  a  decrease  in 
a  performance  variable,  it  is  only  a  presumption  that  such  affects  are  caused  through 
difficulties  in  comprehending  the  more  complex  artifacts.    This  section  focuses  on  studies 
that  more  directly  address  the  issue  of  comprehension,  through  use  of  dependent  variables 
that  operationalize  comprehension  or  other  types  of  emphasis.   This  issue  has  been 
identified  as  critical  to  the  subject  of  maintenance  for  some  time.    Fjelstad  and  Hamlen 
reported  back  in  the  late  1970's  their  belief  that  more  than  fifty  percent  of  all  software 
maintenance  effort  was  devoted  to  comprehension  [Fjelstad  and  Hamlen  1983].    Dean  and 
McCune,  in  a  survey  of  Air  Force  maintainers  reported  that  the  top  three  problems  in 
software  maintenance  were  all  comprehension  related:  (1)  a  high  rate  of  personnel 
turnover  requiring  that  unfamiliar  maintainers  work  on  the  systems,  (2)  difficulty  in 
understanding  the  software,  particularly  in  the  absence  of  good  documentation,  and  (3) 
difficulty  in  determining  all  of  the  relevant  places  to  make  changes  due  to  an  inadequate 
understanding  of  how  the  program  works  [Dean  and  McCune  1983].   (See  Table  4.)  Of  the 
work  covered  in  this  review,  two  research  problems  dominate:  the  variation  in  individual 
maintainer' s  ability  and  the  efficacy  of  various  aids  to  maintenance  comprehension. 

3.1  Individual  Differences 

One  consistent  empirical  observation  has  been  that  certain  individuals,  often  those 
with  greater  experience,  are  simply  better  at  maintenance  tasks  under  nearly  all  conditions 
than  those  without  such  skills.   In  a  study  whose  main  focus  was  on  the  optimum  amount 
of  program  indention,  Miara  et  al.  found  that  expert  subjects  (those  with  three  or  more 
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years  of  programming  in  school  and  /  or  more  than  two  years  of  professional 
programming)  outperformed  novices  under  all  conditions  [Miara  et  al.  1983].   Curtis  et  ai 
report  than  in  a  series  of  experiments  involving  professional  programmers,  the  number  of 
years  of  experience  was  not  a  significant  predictor  of  comprehension,  debugging,  or 
modification  time,  but  that  number  of  languages  known  was  [Curtis  et  al.  1989].   They 
suggest  that  this  means  that  breadth  of  experience  may  be  a  more  reliable  guide  to  ability 
than  length  of  programming  experience.   Most  recently,  in  a  study  of  undergraduate 
programmers,  Oman  et  al.  found  that  seniors  outperformed  juniors  who  outperformed 
sophomores  in  all  categories  [Oman  et  al.  1989]. 

All  of  this  research  gives  an  important  message  to  researchers  that  the  ability  and 
experience  levels  of  subjects  in  experiments  must  be  carefully  controlled  for  if  meaningful 
results  are  to  be  obtained.   However,  ultimately  knowing  that  more  experienced 
maintainers  perform  at  a  higher  level  is  only  interesting  if  managers  understand  why  this 
is  so.    For  example,  do  some  individuals'  problem  solving  styles  naturally  lend  themselves 
to  being  good  maintainers,  such  that  they  perform  well,  are  rewarded  appropriately,  and 
stay  to  gain  additional  experience  in  maintenance?   Or,  does  performing  a  lot  of 
maintenance  work  provide  experiential  learning  such  that  all  or  most  software  engineers 
could  eventually  become  good  maintainers?   If  this  were  better  understood  then  managers 
could  take  action  to  (1)  make  more  informed  choices  about  assigning  individual 
maintainers  to  tasks,  and  (2)  improve  conditions  under  which  maintainers  gain  such 
experience  faster,  so  that  less-skilled  maintainers  can  emulate  the  better  performers. 

Two  studies  in  this  review  have  had  as  their  focus  an  attempt  to  construct  theories  of 
comprehension  from  detailed  investigations  of  observing  software  engineers  performing 
maintenance.    Liftman  et  al.  videotaped  ten  professional  programmers  as  they  went  about 
doing  a  constructed  maintenance  problem  [Littman  et  al.  1987].   They  identified  two 
generic  strategies  which  they  called  "systematic"  and  "as-needed".   As  the  names  imply, 
maintainers  employing  a  systematic  strategy  attempted  to  construct  a  mental  model  of 
how  the  program  worked,  and  then  used  that  mental  model  in  the  performance  of  their 
maintenance  task.   Others  only  examined  the  program  code  when  necessary  to  check 
specific  hypotheses.   The  systematic  maintainers  were  the  only  ones  who  successfully 
completed  the  maintenance  tasks.    Recently,  Robson  et  al.  have  noted  that  this  finding 
may  be  an  artifact  of  the  small  program  used  in  the  experiment,  and  that  on  large 
programs  this  approach  may  be  infeasible  [Robson  et  al.  1991]. 
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Letovsky  videotaped  and  analyzed  verbal  protocols  of  six  professional  programn^ers 
[Letovsky  1987].    These  verbal  protocols  revealed  micro-level  processes  that  maintainers 
performed  as  well  as  knowledge  types  that  maintainers  sought  out  as  they  went  about  their 
task.   The  author  suggests  that  such  data  will  be  useful  both  to  researchers  in  developing 
cognitive  theories  of  maintenance  and  to  practitioners  in  identifying  what  types  of  aids 
might  be  most  useful  in  supporting  maintenance. 

3.2  Aids  to  Comprehension 

Within  this  area  a  significant  portion  of  the  research  has  been  addressed  to  the  relative 
utility  of  various  aids  to  comprehension,  most  particularly  graphical  versus  text-based  aids. 
Shneiderman  et  al.  in  a  lab  experiment  testing  comprehension  found  that  groups  using 
data  structure  diagrams  outperformed  those  without  such  aids  or  with  control  flow 
documentation  [Shneiderman  1982].    Lehman  conducted  an  experiment  and  found  that 
the  graphical  data  structure  diagrams-equipped  group  took  less  time  and  had  fewer  errors 
on  the  same  task  as  a  group  equipped  with  textual  Yourdon  style  data  dictionaries  [Lehman 
1989].   An  experiment  by  Baecker  even  showed  that  graphically  enhanced  text  was  a 
statistically  significantly  superior  aid  to  plain  text  in  a  test  of  comprehension  [Baecker  1988]. 

However,  in  a  study  by  Ramsey  et  al.  they  found  that  groups  equipped  with  program 
design  language  documentation  (PDLs)  performed  better  than  flowchart  groups  [Ramsey  et 
al.  1983].  This  study  was  later  criticized  by  the  previously  cited  study  by  Curtis,  et  al.  for 
having  results  that  may  have  been  confounded  by  inadequate  controls  in  the  experimental 
design  with  respect  to  the  experience  level  of  the  programmers  (1989,  pp.  170-171).   In 
particular,  it  may  have  been  the  case  that  the  flowcharts  were  used  by  a  group  that  was,  on 
average,  of  less  ability  than  the  PDL  group.    In  their  own  experiments  Curtis  et  al.  found 
the  choice  of  whether  a  constrained  language  or  ideograms  (symbols)  was  superior  to  be 
somewhat  task-dependent.   However,  natural  language  was  never  found  to  be  a  superior 
format  in  any  of  their  four  experiments. 

The  Curtis  et  al.  experiments,  besides  being  the  most  recent  of  the  studies  reviewed 
here,  also  offer  a  dear  model  for  how  such  experiments  on  comprehension  should  be 
performed.    They  also  provide  a  detailed  review  of  previous  research  on  comprehension, 
and  this  paper  is  recommended  reading  for  researchers  beginning  work  in  this  area.   It 
concludes  with  the  suggestion  that  "Little  additional  research  is  needed  that  compares 
flowcharts  to  a  program  design  language  on  module-level  tasks.     Rather,  attention  needs  to 
be  focused  on  the  context  of  the  documentation,  such  as  different  ways  of  representing  data 
structures  or  state  transitions."  (1989,  p.  202). 
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4.  GENERAL  MANAGEMENT  ISSUES  RESEARCH 

While  the  research  in  the  preceding  Uvo  sections  tends  to  be  centered  on  narrowly 
defined  research  questions,  the  work  that  is  grouped  together  here  centers  on  research 
questions  that  are  higher  level  and  more  general  in  nature.    (See  Table  5.)   The  unit  of 
analysis  in  these  studies  is  more  typically  at  the  project  or  system  level,  as  opposed  to  the 
the  work  in  the  previous  sections  which  was  much  more  focused  at  the  program  or 
module  level.    Therefore,  while  all  maintenance  research  tends  to  have  implications  for 
management,  the  work  reviewed  in  this  section  generates  conclusions  that  typically 
require  higher  level  management  intervention  if  the  recommendations  are  to  be 
successfully  implemented. 

The  higher  level  unit  of  analysis  is  also  reflected  in  the  fact  that  a  much  higher 
percentage  of  the  studies  reviewed  in  this  section  do  not  report  statistical  test  results,  but 
tend  to  rely  more  on  descriptive  data.   This  difference  is,  of  course,  related  in  that  a  larger 
unit  of  analysis  generally  results  in  a  smaller  sample  size  which  may  be  less  amenable  to 
statistical  analysis. 

Two  main  streams  of  research  are  present  in  this  work.   The  first  focuses  on  the  causes 
of  maintenance  work,  and  seeks  to  prevent  or  reduce  the  need  for  maintenance.   The 
second  focuses  on  the  'repair  vs.  replace'  question,  seeking  to  determine  whether  it  is  more 
cost  effective  to  maintain  an  existing  piece  of  software  or  to  simply  write  a  new  program  to 
replace  it. 

4.1  Causes  of  Maintenance  Activity 

A  survey  of  DPMA  members  by  Lientz  and  Swanson  laid  the  groundwork  for  much  later 
work  in  software  maintenance  [Lientz  and  Swanson  1980;  Lientz  and  Swanson  1981].      They 
present  a  typology  of  maintenance  consisting  of  corrective  (repairs),  adaptive  (change 
accommodation)  and  perfective  (enhancements),  which  has  since  gone  on  to  become  the 
standard  terminology  in  this  area^.   The  approximate  distribution  of  maintenance  work  was 
that  more  than  half  was  perfective,  approximately  one-quarter  was  adaptive,  and  the 
remainder  was  corrective.   Their  survey  respondents  reported  that  user  problems,  specifically 
lack  of  user  knowledge,  was  believed  to  be  a  critical  source  of  maintenance  activity. 


^This  typology  was  first  presented  by  Swanson  in  "The  Dimensions  of  Maintenance"  in  Proceedings  of  the  Second 
International  Conference  on  Software  Engineering.  1976,  pp.  492^97. 
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Determined  tJial  maintenance  consumed  about 
half  of  IS  deparuncnl's  staff  time.    Six 
problem  areas  were  identified.    Lack  of  user 
knowledge  was  a  leading  source  of  re|H)rted 
problems. 

Maintenance  ex|)endilures  tend  to  increase 
wilh  prograni  age.    Redeveloped  programs 
(functional  replacement)  tend  to  require  less 
maintenance  ex(>enditures  than  the  originals 
Poor  documentation  significantly  incieases 
mainlenance  expenditures.     A|)i)licalioii 
programs  written  in  assembly  language 
require  higher  maintenance  expenditures  than 
their  higher  level  counter()arts. 

Based  on  cflort  requirements  for  the 
development  phase,  liie  Rayleigh  curve 
seems  to  work  to  predict  effort  requirements 
for  corrective  maintenance,  bul  significantly 
underestimates  effort  when  adajitive  and 
perfective  niaintenajice  are  included 

Significant  portions  of  the  code  were  proven 
lo  be  reusable,  and  subsequent  evaluation  of 
tlie  actual  effects  of  reuse  led  to  estimates  of 
an  increa.sc  in  [)roduclivity  of  up  to  50%. 
Reuse  also  leads  lo  simplification,  which 
should  be  of  major  benefit  to  maintenance 
efforts  by  aiding  programmer 
comprehension. 

A  nineteen  month  longitudinal  study 
suggested  that  significant  difference  existed 
in  maintainers'  performance  levels. 

Provides  detailed  data  regarding  causes  of 
modifications  and  errors.    Unplanned  design 
modifications  and  small,  easily  corrected 
eiKiis  were  most  common. 

Detailed  longitudinal  data  of  6  years  of 
system  updates  lo  a  set  of  5  medical  systems 
Systems  proved  relatively  stable  and  were 
able  to  be  maintained  by  a  relatively  small 
staff  believed  due.  in  pari,  to  the  aj)plKaiion 
generator  used 
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Lin  and  Gustafson  further  investigated  the  distribution  of  work  by  examining  before 
and  a:..:  versions  of  two  COBOL  systenns  [Lin  and  Gustafson  1988].  The  combined 
percentage  of  perfective  and  corrective  maintenance  activity  was  greater  than  seventy 
percent  in  one  case  and  greater  than  ninety  percent  in  the  other.   Adaptive  was  only 
approximately  ten  percent,  and  a  number  oi  new  categories  (e.g.,  adding  and  deleting 
comments)  all  represented  small  percentages  of  the  work. 

Weiss  and  Basili  did  a  detailed  investigation  of  the  change  data  from  three  systems  at 
the  Software  Engineering  Laboratory  [Weiss  and  Basili  1985].   They  found  that 
approximately  forty  percent  of  changes  were  to  correct  errors.  Their  data  did  not  support 
some  conventional  wisdom  in  software  engineering;  for  example,  interfaces  did  not 
appear  to  be  particularly  problematic,  and  most  corrections  were  small  changes  in  only  one 
location. 

Additional  work  in  this  area  would  be  useful  in  better  understanding  how 
maintainers  actually  spend  their  time.   In  particular,  it  may  be  time  to  develop  a  finer- 
grained  taxonomy  that  further  develops  the  three  types  of  activities  first  proposed  by 
Swanson.    Beyond  this  documentation  of  effort  distribution,  analysis  linking  patterns  in 
the  distribution  of  maintenance  work  could  suggest  improvements  in  the  initial 
development  process  that  would  reduce  later  expenditures  on  maintenance.    For  example, 
lower  than  average  amounts  of  corrective  maintenance  and/or  easier  (less  expensive) 
adaptive  and  perfective  maintenance  might  be  associated  with  systems  developed  with 
certain  modern  development  practices.    Systems  with  higher  levels  of  software  re-use  may 
be  associated  with  lower  levels  of  corrective  maintenance. 

4.2  Repair  versus  Replace 

One  relatively  unsettled  question  is  how  the  distribution  of  work  may  change  over 
time  as  systems  age.   Guimaraes  observed  that  successive  program  changes  tend  to 
complicate  the  logical  flows  of  the  program  and  to  render  program  documentation 
obsolete,  thus  increasing  maintenance  expenditures  [Guimaraes  1983].   Lientz  and 
Swanson  [Lientz  and  Swanson  1981]  agree  that  maintenance  costs  increase  with  program 
age,  but  offer  results  that  suggest  that  the  increases  may  be  avoidable  through  managerial 
action:   "Though  system  size  and  age  are  seen  to  he  strongly  associated  with  the  problems  of 
maintenance,  this  association  was  shown  in  subsequent  analysis  to  be  explainable  in  terms 
of  other,  intervening  variables,  viz.  magnitude  and  allocation  of  maintenance  effort  and 
the  relative  development  experience  of  maintainers  of  the  system." 
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If  the  effects  of  age  on  software  were  better  understood,  then  this  could  offer  insight 
into  the  question  of  when  to  replace  rather  than  repair  (maintain)  a  given  software 
component.    Most  of  the  data  collected  so  far  suggest  that  modification  is  more  expensive 
than  is  commonly  believed,  and  that  the  development  cost  savings  of  using  modified 
modules  may  pale  in  comparison  to  the  later  costs  of  maintaining  the  resulting  system 
[Bowen  1983]  [Basili  and  Perricone  1984].9 

Bowen  analyzed  error  data  from  a  large  (6000  module)  Hughes  air  defense  project  and 
determined  that  a  composition  of  a  balanced  mixture  of  new  and  lifted  (modified  from 
existing  code)  software  (e.g.  35/65  to  75/25)  is  nearly  four  times  as  error-prone  as  a 
composition  of  extremely  unbalanced  mixtures  of  new/lifted  software  (e.g.  15/85  or  90/10). 
This  implies  that  if  one  is  planning  to  utilize  pieces  of  an  existing  system,  one  should 
either  use  it  sparingly  in  a  new  system,  or  use  it  nearly  completely  intact.   If  large  scale 
modifications  are  planned,  it  seems  much  more  efficient  to  design  from  scratch  to  avoid 
the  prohibitive  maintenance  costs  of  problem  fixes  associated  with  reuse.    Supporting  this 
view  is  the  study  by  Card  et  al.  where  problem  fixes  required  ten  times  the  effort  of 
developing  new  code  [Card  et  al.  1987]. ^'^ 

While  they  did  not  specify  optimal  blends  of  new  and  modified  modules  in  the 
construction  of  new  systems,  Basili  and  Perricone  concluded  that  adapted  modules  taken 
from  other  systems  were  more  expensive  to  maintain  [Basili  and  Perricone  1984].   One 
factor  that  may  have  contributed  to  this  result  is  that  they  also  determined  that  most  of  the 
errors  in  the  systems  they  analyzed  were  due  to  incorrect  or  misinterpreted  functional 
specifications,  and  the  single  largest  error  types  were  those  involving  interfaces.   Modules 
borrowed  from  other  systems  are  likely  to  be  less  comprehensible  to  programmers  than  the 
newly  designed  code,  and  thus  would  be  especially  prone  to  these  types  of  errors. 

In  seeming  contrast  to  the  above,  Lanergan  and  Grasso  offer  evidence  of  a  large  scale 
success  in  the  reuse  of  software  components  [Lanergan  and  Grasso  1984].  They  examined 
5000  source  COBOL  programs  at  Raytheon,  and  identified  redundant  sections  of  code  that 
were  prime  candidates  for  standardization.   Subsequent  evaluation  of  the  actual  effects  of 
using  these  standardized  functional  modules  led  to  estimates  of  an  increase  in 
productivity  of  up  to  50%.   The  examples  cited  were  simple,  however,  comprising  routines 


^Although  not  all  data  collected  on  this  topic  are  in  agreement.  In  a  study  of  65  COBOL  maintenance  projects  it 
was  found  that  the  costs  associated  with  modified  lines  of  code  were  approximately  equal  to  new  lines  [Banker, 
et  al,  1987] 

^*^This  is  also  related  to  some  interesting  theoretical  work  done  by  Code,  et  al,  whose  model  results  suggest, 
among  other  propositions,  the  somewhat  surprising  conclusion  that  the  optimal  time  within  which  to  replace 
larger  systems  is  shorter  than  that  for  smaller  systems  [Code,  et  al,  1990). 

25 


to  perform  date  conversions,    part  number  validations,  or  data  field  edits.    Reuse  of 
relatively  atomic  functions  such  as  these  has  proved  effective,  but  the  advantages  may  not 
carry  over  quite  as  well  to  modules  with  more  complex  functions  and  interfaces. 

Further  research  into  when  the  benefits  of  reusability  are  offset  by  the  cost  to  modify 
seem  warranted,  as  well  as  more  longitudinal  studies  that  document  how  systems  evolve. 

5.  METHODOLOGICAL  ISSUES  IN  EMPIRICAL  RESEARCH  IN  SOFTWARE 
MAINTENANCE 

One  advantage  of  a  review  that  examines  so  many  years  of  research  is  that  it  permits 
some  observations  to  be  made  about  meta-issues.   One  issue  that  has  already  been  raised  is 
the  sheer  dearth  of  research  in  this  area.    A  second  such  issue  are  methodological  concerns 
in  the  research.    Two  main  topics  merit  discussion  here,  the  choice  of  methodologies  and 
the  care  with  which  research  is  conducted. 

5.1  Methodological  Choice 

Proponents  of  alternative  research  methodologies  seem  somewhat  inclined  to  criticize 
other  approaches  rather  than  simply  benefiting  from  assimilating  those  findings  into  their 
own  work.    A  common  division  is  between  those  who  conduct  field  research  (typically 
field  studies  rather  than  field  experiments)  and  those  who  conduct  experiments  (typically 
laboratory  experiments  rather  than  field  experiments).    The  experimentalists  emphasize 
the  need  to  find  causes  of  behaviors  and  often  complain  about  the  lack  of  a  theoretical  base 
in  some  field  studies.   For  example,  Soloway  and  Ehrlich,  at  the  end  of  an  article  describing 
their  experiments,  note  "More  importantly,  our  approach  is  to  provide  explanations 
(emphasis  in  original)  for  why  a  program  may  be  complex  and  thus  hard  to  comprehend. 
Towards  this  end  we  have  attempted  to  articulate  the  programming  knowledge  that 
programmers  have  and  use.    Thus,  our  intent  is  to  move  beyond  correlations  (emphasis    in 
original)   between  programmer  performance  and  surface  complexity  as  measured  by 
Halstead  metrics,  lines  of  code,  etc,  to  a  more  principled,  cognitive  explanation."  [Soloway 
and  Ehrlich  1984]. 

On  the  other  side,  field  researchers  often  complain  about  the  lack  of  external  validity  of 
most  lab  experiments  which  typically  use  student  programmers  and  small  programs.   For 
example,  Conte  et  al.  note  "The  results  from  controlled  experiments  which  will  be 
discussed  later,  are  usually  limited  by  economic  constraints  to  small  projects  by  individual 
programmers,  and  are  usually  performed  only  in  universities.     Such  results  are  useful  in 
providing  insights  to  certain  parameters  of  the  programming  process,  but  are  not  normally 
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generalizable  to  team  programming  and  large  projects,  which  are  common   in   industry." 
[Conte  etal.  1986]. 

Both  of  these  statements,  while  true,  emphasize  the  shortcomings  of  alternative 
research  methodologies  without  conveying  the  notions  (1)  that  difficult  research  problems 
such  as  those  being  investigated  in  this  research  are  likely  to  benefit  from  attack  by 
dissimilar  methods,  and  (2)  that  given  the  current  shortage  of  research  in  this  area,  almost 
all  published  research  is  providing  positive  marginal  contribution.    It  would  seem 
appropriate  for  researchers  to  attempt  to  assimilate  the  findings  from  the  other  streams 
into  their  own  work  so  that  all  groups  would  move  ahead.   Only  a  very  small  number  of 
field  experiments  have  been  reported,  and  some  of  these  have  been  criticized  as  not  being 
done  as  well  as  they  might  have  [General  Services  Administration  1987;  Zvegintzov  1988]. 
As  it  stands  now,  a  review  of  Tables  2,  3,  and  4  reveals  that  problems  and  methodologies 
are  tightly  linked,  e.g.,  complexity  metrics  work  is  almost  entirely  field  study  based  and 
comprehension  work  is  almost  entirely  laboratory  experiments.    While  to  a  certain  degree 
this  bias  is  natural  and  appropriate,  given  the  topics  studied,  over-reliance  on  a  subset  of 
research  tools  may  hinder  progress.   What  may  be  required  is  collaboration  among 
maintenance  researchers  who  reflect  different  traditions  and  who  possess  complementary 
research  skill  sets. 

5.2  Methodological  Rigor 

Empirical  work  in  software  engineering  in  general  (not  just  maintenance)  has  been 
sometimes  criticized  for  lack  of  methodological  rigor,  e.g.  [Kearney  et  al.  1986].  Work  in 
this  area  suffers  from  a  number  of  handicaps  owing  to  the  difficulty  of  the  research 
problem  ~  the  large  number  of  potential  factors  to  model,  the  absence  of  standard 
definitions  for  dependent  and  independent  variables,  and  the  lack  of  large  and /or  readily 
available  data  sets  with  which  to  analyze. 

Unfortunately,  these  limitations  are  sometimes  overlooked,  or  at  least  not 
acknowledged,  by  researchers.   A  recent  summary  of  a  set  of  thirteen  general  criticisms  has 
been  provided  by  MacDonell,  where  he  notes  deficiencies  in  such  areas  as  experimental 
method  and  design,  data  collection,  and  statistical  analysis  and  interpretation  [MacDonell 
1991]. 

One  particular  point  of  MacDonell's  that  is  borne  out  by  the  data  collected  in  this 
review  and  is  highlighted  in  the  tables  is  the  (over)-reliance  on  Pearson  correlation 
[MacDonell  1991,  pp.  146-147].   One  concern  is  the  sometimes  casual  manner  in  which 
researchers  move  from  interpreting  what  are  often  exploratory  correlation  results  with 
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causation.     Kearney  et  al.  note  "When  large  numbers  of  differing  experimental  conditions 
are  examined,  the  likelihood  of  finding  accidental  relationships  is  high.     The  unfortunate 
consequence  of  this  practice  is  a  substantial  inflation  of  the  probability  of  making  a  type  [ 
error-  inferring  the  existence  of  a  non-existent  relationship."    (page  1048)    This  concern 
seems  worthy  of  repeating,  especially  in  light  of  a  recent  trend  observed  in  the  tables 
towards  greater  use  of  exploratory  factor  analysis  in  software  engineering  maintenance 
research. 

A  more  general  concern  is  the  extensive  use  of  parametric  statistical  methods,  such  as 
Pearson  correlation,  whose  proper  use  includes  an  understanding  of  the  method's 
distributional  assumptions.    Shepperd  provides  a  very  relevant  example  of  where  such 
assumptions  are  violated  -  the  use  of  the  number  of  errors  as  a  dependent  variable 
[Shepperd  1988].   Clearly,  this  can  never  be  negative,  and  therefore  at  best  this  distribution 
is  truncated  normal,  yet  such  concerns  are  rarely  acknowledged  by  authors.   Two 
exceptions  from  this  review  worthy  of  emulation  by  other  researchers  are 
acknowledgements  by  Curtis  et  al.  and  Woodfield,  et  at.: 

"In  using  ANOVA,  we  assume  that  the  values  of  the  dependent  variable  are  normally 
distributed.  Unfortunately  ,  this  is  typically  not  the  case  with  response-time  measures.  For 
most  response-time  measures,  the  variance  is  proportional  to  the  mean,  since  many  of  the 
values  are  near  zero  and  the  distribution  is  positively  skewed.  For  all  the  analysis  reported 
in  experiment  1,  a  logarithmic  transformation  was  applied  to  the  response  time  to 
attenuate  the  influence  of  extreme  scores  an  produce  a  more  normal  distribution..."  [Curtis 
etal.  1989]. 

"The  most  common   correlation   measure   is   the  Pearson   product-moment  correlation 
coefficient,  which  requires  that  data  be  from  interval  scales  with  underlying  normal 
distributions,  with  the  sets  of  data  being  correlated  having  nearly  equal  variance. ..some 
models  yield  outlier  estimates  that  do  not  meet  the  normal  distribution  assumption. 
Thus,  we  also  use  the  Spearman  rank  correlation  coefficient  to  determine  how  well 
estimates  of  programming  times  relate  to  actual  programming  times."    [Woodfield  et  al. 
1981b] 

Despite  the  ease  in  doing  so,  such  acknowledgements  are  rare  in  this  literature.   In 
general,  for  much  of  the  empirical  research  in  software  maintenance  it  would  seem  that 
greater  use  of  non-parametric  (distribution  free)  statistical  tests  would  be  appropriate. 
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6.  CONCLUDING  REMARKS 

The  first  broad  conclusion  from  tl^is  review  and  analysis  of  empirical  research  in 
software  maintenance  is  that  the  area  has  been  understudied  relative  to  its  practical 
import.    It  confirms  Schneidewind's  observation  that  the  software  engineering  field  needs 
to  reassess  its  priorities  with  regard  to  research  topic  selection  and  devote  more  attention 
to  maintenance. 

In  terms  of  specific  research  areas  covered,  this  review  noted  four  broad  areas  of 
coverage:  (1)  software  modularity  and  structure,  (2)  general  software  complexity  metrics,  (3) 
software  comprehension,  and  (4)  general  management  issues.    This  section  focuses  on 
discussing  suggestions  for  future  research,  and  these  recommendations  are  summarized  in 
Table  6  which  appears  at  the  end  of  this  section. 

A  great  deal  of  work  has  been  directed  at  determining  the  benefits  of  modularity,  with 
the  most  recent  work  suggesting  that  there  is  an  optimum  level  in  each  environment  that 
can  be  discovered  through  the  use  of  statistical  models.   Further  work  to  confirm  this 
finding  and  to  determine  the  range  of  values  and  determinants  of  the  differences  would  be 
useful,  and  could  eventually  lead  to  the  development  of  local  standards  for  proper  practice. 
There  has  been  less  work  on  the  issue  of  inter-module  coupling,  but  all  of  the  results  argue 
for  greater  emphasis  on  reducing  coupling  when  possible.   There  is  some  limited  evidence 
that  the  "ripple  effects"  caused  by  the  propagation  of  errors  through  coupling  are  more 
expensive  to  correct  than  primary  errors,  but  further  work  on  this  topic  seems  necessary. 

Considerable  effort  has  gone  into  correlating  complexity  metric  scores  with  increased 
effort,  errors,  changes  or  all  three,  and  it  seems  clear  that  strong  relationships  do  exist. 
What  also  seems  clear  is  that  many  complexity  metrics  measure  the  same  dimension,  e.g., 
program  size.    Therefore,  in  the  absence  of  some  other  compelling  argument,  the 
publication  criteria  for  new  metrics  must  be  that  they  be  shown  to  be  sufficiently 
orthogonal  to  existing  measures.   That  is,  complexity  metrics  need  to  be  shown  to  be 
adding  value  beyond  representing  size.^^  It  has  also  been  suggested  that  systems  grow  in 
complexity  as  they  age,  but  why  this  may  be  true  is  not  well-documented.  There  is  a  need 
for  more  longitudinal  studies  that  can  reflect  a  system's  status  at  various  points  in  its  life. 
Most  useful  would  be  studies  that  track  all  phases  of  the  life  cycle  (including  analysis  and 
design)  so  that  investigations  could  be  done  to  determine  the  effects  on  subsequent 


^^A  pilot  study  in  this  regard  is  the  work  in  cyclomatic  complexity  density  [Gill  and  Kemerer,  1991). 
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maintenance  requirements  caused  by  using  different  techniques  and  emphases  during  the 
earlier  phases. 

A  significant  amount  of  research  activity  has  been  devoted  to  the  issue  of  maintainer 
comprehension  of  existing  source  code  and  documentation.    Wide  individual  variations 
in  performance  have  been  noted  by  many  researchers.   One  laboratory  finding  on  this  topic 
is  that  a  systematic  approach  to  performing  maintenance  tasks  appears  more  effective  than 
the  technique  of  referencing  the  code  only  as  needed  for  each  step  in  performing  the  task. 
Further  work  is  required  both  to  validate  this  finding  and  to  discover  other  habits  of  good 
maintainers  so  that  these  techniques  can  be  further  routinized  and  taught  to  new 
maintainers.   A  second  finding  in  this  area  is  that  graphical  aids  seem  to  be,  on  the  whole, 
as  good  as  or  better  than  text-based  documentation.   With  the  increasing  availability  of  easy 
to  use  software  for  generating  this  documentation  this  would  appear  to  be  an  inexpensive 
recommendation  for  managers  to  adopt. 

In  terms  of  higher  level  managerial  issues  two  foci  were  noted,  the  causes  of 
maintenance  and  the  question  of  repair  versus  replacement.   The  data  on  the  causes  of 
maintenance  are  somewhat  mixed,  and  do  not  always  represent  consistent  or  sufficiently 
detailed  definitions.   It  will  be  extremely  difficult  to  evaluate  the  impact  of  improved 
practices,  in  design  or  elsewhere,  if  accurate  tracking  of  the  scope  and  origin  of 
maintenance  requests  cannot  be  done.    More  work  needs  to  be  done  to  track  maintenance 
work  in  practice,  in  part  to  support  the  aforementioned  need  for  more  longitudinal  data. 

The  repair/replace  issue  is  often  discussed,  but  is  difficult  to  research.   Some  research 
suggests  that  repair  is  more  expensive  than  new  development,  but  research  in  the  software 
reuse  literature  suggests  that  significant  savings  can  be  achieved  through  code  reuse. 
Savings  depend  on  the  degree  to  which  the  reused  code  needs  to  be  modified,  but  little  is 
known  about  even  how  to  measure  this  phenomenon. 

In  terms  of  methodological  issues  greater  emphasis  should  be  placed  on  using 
multiple,  diverse  research  methods  to  address  the  large  number  of  remaining  research 
issues.    Empirical  researchers  in  software  maintenance,  particularly  new  ones,  are 
reminded  by  a  number  of  authors  about  using  appropriate  caution  in  borrowing 
techniques,  particularly  statistical  tools,  from  other  disciplines,  without  examining  the 
assumptions  necessary  to  appropriately  apply  them. 

It  is  important  to  try  and  step  back  from  the  existing  studies  to  attempt  to  determine 
what  is  missing  or  at  least  neglected.    One  common  concern  about  documentation  not 
addressed  by  laboratory  studies  is  that  in  practice  maintainers  often  do  not  use  it  at  all, 
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regardless  of  format,  perhaps  because  they  do  not  trust  that  it  has  been  kept  consistent  with 
the  existing  system.    Researchers  and  vendors  in  new  systems  development  need  to 
address  this  issue  by  making  automatic  generation  and  update  of  documentation  of  feature 
of  their  new  tools,  lest  the  potential  comprehension  gains  of  proper  formatting  of  such 
documentation  be  wasted. 

An  area  of  research  that  is  conspicuous  by  its  absence  is  work  on  the  organizational 
aspects  of  software  maintenance.    Work  on  comprehension  focuses  narrowly  on  an 
individual's  approach  to  a  piece  of  code  and  work  on  complexity  metrics  tends  to  ignore 
the  maintainer  completely.    In  practice  there  is  considerable  influence  from  the 
organizational  environment  in  terms  of  the  presumed  undesirability  of  maintenance 
work  and  the  subsequent  likely  effects  on  morale  and  performance. ^2  While  several 
academic  studies  surveyed  here  mention  this  in  passing,  with  the  exception  of  recent  work 
by  Swanson  and  Beath,  none  address  the  organizational  component  [Swanson  and  Beath 
1989a;  Swanson  and  Beath  1990].   It  seems  likely  that  the  organizational  effects  on 
performance  are  at  least  as  great  as  those  that  have  been  studied  in  detail,  such  as  work  on 
documentation  formats. 

For  example,  is  poor  performance  in  maintenance  a  result  of  low  morale  of  the 
maintainers?    Is  maintenance's  low  occupational  status  in  the  software  engineering 
community  a  function  of  the  common  practice  of  assigning  relatively  inexperienced  staff 
members  to  this  role?   And,  in  turn,  how  does  the  use  of  these  junior  staff  members 
contribute  to  poor  performance?   Do  the  benefits  of  assigning  software  engineers  with 
higher  levels  of  experience  to  maintenance  outweigh  the  possibly  increased  cost  of 
turnover?   These  are  difficult  research  questions  to  operationalize  and  test  in  the  field,  and 
what  would  be  appropriate  are  collaborative  research  projects  between  organizationally- 
oriented  researchers  and  more  traditional  software  engineering  researchers,  where  the 
respective  interests  and  skills  of  each  could  lead  to  some  very  interesting  and  carefully 
researched  results. 

In  general,  software  maintenance  is  likely  to  gradually  evolve  into  a  better  understood 
activity,  but  there  are  economic  advantages  to  speeding  this  process.   As  software  managers 
recognize  the  importance  of  the  maintenance  process,  more  resources  can  be  allocated  to 
improve  it.   This  gradual  realization  of  importance  may  help  alleviate  the  possible  stigma 
and  morale  problems  associated  with  maintenance  work,  and  is  crucial  to  promoting 
further  research. 


^^Schneidewind  likens  working  in  maintenance  to  "having  bad  breath"  [Schneidewind  1987]. 
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Because  so  little  theory  currently  exists  it  remains  important  that  research  be 
empirically  driven  in  order  to  record  the  observations  that  will  lead  to  greater  theory 
development  in  this  area.   An  obstacle  faced  by  researchers  is  the  difficulty  in  obtaining 
good  data  to  analyze.   Data  collected  from  field  studies  are  often  not  complete,  and  can  be 
inaccurate  depending  on  how  well  constraints  are  enforced  ensuring  consistent  data 
reporting.   In  addition  to  inaccuracies,  it  may  be  the  case  that  organizations  are  reluctant  to 
release  what  they  may  view  as  proprietary  data.  This  has  been  suggested  as  one  of  the 
causes  for  the  emphasis  in  the  research  literature  on  maintenance  tasks  being  done  in  an 
academic  or  military  setting  [Hale  and  Haworth  1988].   One  solution  to  this  problem  may 
be  the  establishment  of  "software  maintenance  research  databases"  where  data  could  be 
contributed  by  organizations  under  the  agreement  that  a  neutral  party,  such  as  a 
university-affiliated  research  center,  would  maintain  the  anonymity  of  the  individual 
contributions. 

In  order  to  facilitate  such  industry  cooperation  and  therefore  an  increase  in  the 
quantity  of  maintenance  research,  studies  need  to  be  conducted  with  an  eye  towards  how 
the  results  can  be  eventually  utilized  by  maintenance  managers.   As  managers  acquire  the 
skills  to  use  metrics  effectively  and  begin  to  benefit  from  software  maintenance  research, 
they  will  be  increasingly  willing  to  encourage  further  studies. 

Lastly,  tools  for  metric  collection  have  historically  been  constructed  by  the  researchers 
as  needed,  and  were  not  readily  available.   More  recently,  automated  tools  have  come  on 
the  market  and  it  is  expected  that  as  data  collection  becomes  easier,  more  data  will  be 
available  to  analyze  and  more  research  will  be  conducted.   As  new  automated  metric 
gathering  tools  become  increasingly  commercially  available,  validation  research  of 
applying  metrics  to  different  environments  will  become  much  easier  and  the  quantity  of 
research  should  increase.   This  validation  research  needs  to  be  coordinated,  correlating  the 
measurement  observations  from  a  wide  variety  of  metrics  and  environments.    With  these 
common  definitions,  better  tools,  and  greater  sharing  of  data  significant  progress  can  be 
expected  in  the  next  decade. 
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Table  6:  Summary  Recommendations  for  Future  Empirical  Maintenance  Research 


Software  modularity  and  structure 


1.  More  work  on  determining  optimal  levels  of  modularity 


2.  More  work  on  effects  of  coupling  minimization  techniques 


3.  More  work  on  relationship  between  coupling  and  ripple  errors 


General   software   complexity   metrics 


1.  Less  work  on  new  metrics  that  have  high  correlations  with  existing  metrics 


2.  More  experimentation  with  regard  to  impacts  of  complexity  on  performance 
Software    comprehension 


1.  More  work  on  developing  measures  of  maintainer  ability  and  experience 


2.  More  work  on  impact  of  experience  on  performance 


3.  More  work  on  how  documentation  is  used  (or  not)  in  the  field 


General    management    issues 


1.  More  work  on  a  finer  grained  taxonomy  of  maintenance  activities 


2.  More  work  on  linking  maintenance  tasks  to  earlier  lifecycle  phase  activities 


3.  More  work  on  documenting  modification  costs  and  relationship  with  reuse 


4.  More  work  on  organizational  issues,  including  morale  and  turnover 
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